Skip to content

Harden CPU overuse warning checks and add unit tests#5512

Merged
adamnovak merged 3 commits into
masterfrom
issues/5478-cpu-usage-warnings
May 14, 2026
Merged

Harden CPU overuse warning checks and add unit tests#5512
adamnovak merged 3 commits into
masterfrom
issues/5478-cpu-usage-warnings

Conversation

@avnig05
Copy link
Copy Markdown
Collaborator

@avnig05 avnig05 commented May 5, 2026

Moves CPU-overuse warning evaluation into Job._check_cpu_usage() so the logic is centralized and easier to test. The check skips warnings when wall-clock runtime is non-positive, warns when an invalid core request is encountered (cores <= 0), and only gives a warning when CPU consumption exceeds the configured threshold factor.

Changelog Entry

To be copied to the draft changelog by merger:

  • PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@avnig05 avnig05 requested a review from adamnovak May 5, 2026 19:16
Copy link
Copy Markdown
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good here.

The tests seem like they were written to pin down exactly what the implementation happens to do right now, rather than what is important about what it promises to do now and in the future. In general we want to test interfaces and not implementation details. For example, if we decide to reword the message to say used XXX times as much CPU as requested instead of used XXXx more CPU than the requested X cores, we don't want the tests to start to fail.

One way to avoid this is to write the bulk of the tests before writing the implementation. If generating tests, it might make sense to provide the generator only with the documentation/signature for the function, and avoid showing it the body, or at least send along exhortations to ignore it.



class TestJob:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment thread src/toil/test/src/jobTest.py Outdated
job = Job()
file_store = Mock()
job._check_cpu_usage(
job_time=0.0, job_cpu_time=100.0, fileStore=file_store)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layout is inconsistent here for no reason I can see.

Suggested change
job_time=0.0, job_cpu_time=100.0, fileStore=file_store)
job_time=0.0, job_cpu_time=100.0, fileStore=file_store
)

Comment thread src/toil/test/src/jobTest.py Outdated
Comment on lines +833 to +844
with subtests.test("warn for invalid cores value"):
job = Job()
file_store = Mock()
job.cores = 0
job._check_cpu_usage(
job_time=10.0, job_cpu_time=10.0, fileStore=file_store
)
file_store.log_to_leader.assert_called_once()
invalid_cores_message = file_store.log_to_leader.call_args.args[0]
invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"]
assert "invalid requested cores value 0" in invalid_cores_message
assert invalid_cores_level == logging.WARNING
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point that there ought to be a warning for 0 cores is not documented as part of the function's docstring, so it seems odd to me to test it. We end up with a test that just constrains the function to do what it currently happens to do, rather than what it promises to do. In a perfect world, we ought to be able to delete the body of a function and reconstruct a version of it that passes all the tests from just the documentation.

I would do something like this, where we make sure we can run the function through with 0 cores but we are agnostic as to whether it complains or not.

Suggested change
with subtests.test("warn for invalid cores value"):
job = Job()
file_store = Mock()
job.cores = 0
job._check_cpu_usage(
job_time=10.0, job_cpu_time=10.0, fileStore=file_store
)
file_store.log_to_leader.assert_called_once()
invalid_cores_message = file_store.log_to_leader.call_args.args[0]
invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"]
assert "invalid requested cores value 0" in invalid_cores_message
assert invalid_cores_level == logging.WARNING
with subtests.test("accepts 0 cores"):
job = Job()
file_store = Mock()
job.cores = 0
job._check_cpu_usage(
job_time=10.0, job_cpu_time=10.0, fileStore=file_store
)

We could alternatively add the point that the function promises to warn on 0 cores to its documentation.

Comment on lines +846 to +865
with subtests.test("threshold boundary does not warn"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_not_called()

with subtests.test("just above threshold warns"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_called_once()
cpu_message = threshold_store.log_to_leader.call_args.args[0]
cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
assert "used " in cpu_message
assert "more CPU than the requested 2 cores" in cpu_message
assert cpu_level == logging.WARNING
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are testing both the behavior at the threshold and the default threshold value, and we don't promise that the threshold will always be exactly 1.05.

It's also not immediately clear that the magic numbers have the right relationship for these tests to work.

I would do something like:

Suggested change
with subtests.test("threshold boundary does not warn"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_not_called()
with subtests.test("just above threshold warns"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_called_once()
cpu_message = threshold_store.log_to_leader.call_args.args[0]
cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
assert "used " in cpu_message
assert "more CPU than the requested 2 cores" in cpu_message
assert cpu_level == logging.WARNING
TIME = 10
THRESHOLD = 1.05
CORES = 2
with subtests.test("threshold boundary does not warn"):
threshold_job = Job(cores=CORES)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=TIME,
job_cpu_time=TIME * CORES * THRESHOLD,
fileStore=threshold_store,
threshold_factor=THRESHOLD,
)
threshold_store.log_to_leader.assert_not_called()
with subtests.test("just above threshold warns"):
threshold_job = Job(cores=CORES)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=TIME,
job_cpu_time=TIME * CORES * THRESHOLD + 0.1,
fileStore=threshold_store,
threshold_factor=THRESHOLD,
)
threshold_store.log_to_leader.assert_called_once()
cpu_message = threshold_store.log_to_leader.call_args.args[0]
cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
assert str(threshold_job.description) in cpu_message
assert cpu_level == logging.WARNING

(I'm also de-constraining the message text here, to test whether a warning that mentions the job is sent, rather than the specific wording of the message.)

Comment thread src/toil/job.py
Comment on lines +3340 to +3344
If *job_time* is non-positive, returns immediately (wall-clock time may
lack resolution for very short jobs). If *cores* is non-positive, logs
a warning that overuse cannot be assessed and returns, avoiding
division by zero. Otherwise, compares *job_cpu_time* to
``job_time * cores * threshold_factor``.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I think we put parameter names in double-backticks or just don't set them off, but this looks more like how the Python official docs do it.

This might be a little more like just the algorithm rewritten in English than we want it to be, but it's probably fine.

@adamnovak adamnovak merged commit cd59ec9 into master May 14, 2026
3 checks passed
@adamnovak adamnovak deleted the issues/5478-cpu-usage-warnings branch May 14, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants